Summary of Assignment1


Chosen expression dataset from GEO (GSE139242)
contact_address contact_city contact_country contact_department contact_email contact_institute
Kirkeveien 166, Laboratoriebygget OSLO Norway Department of Medical Genetics OUS

Initial Defined Groups
  1. cell type (CD4, CD8)
  2. tissues( thymus, bloodinfant, bloodadult)
  3. patients( or Individuals numbered from 1 to 5)
celltype tissue patient
CD4_thymus_1 CD4 thymus 1
CD4_thymus_2 CD4 thymus 2
CD4_thymus_3 CD4 thymus 3
CD4_thymus_4 CD4 thymus 4
CD4_bloodinfant_1 CD4 bloodinfant 1
CD4_bloodinfant_2 CD4 bloodinfant 2
CD4_bloodinfant_3 CD4 bloodinfant 3
CD4_bloodinfant_4 CD4 bloodinfant 4
CD4_bloodadult CD4 bloodadult NA
CD8_thymus_1 CD8 thymus 1
CD8_thymus_2 CD8 thymus 2
CD8_thymus_3 CD8 thymus 3
CD8_thymus_4 CD8 thymus 4
CD8_thymus_5 CD8 thymus 5
CD8_bloodinfant_1 CD8 bloodinfant 1
CD8_bloodinfant_2 CD8 bloodinfant 2
CD8_bloodinfant_3 CD8 bloodinfant 3
CD8_bloodadult CD8 bloodadult NA

Compare the distribution of pre and post normalization


MDS plots based on different groups


BCV Plot


MeanVar Plot

## 
## The downloaded binary packages are in
##  /var/folders/4w/5ghh1f910_zfbnfv6h5rst9w0000gn/T//Rtmpn2JddV/downloaded_packages

Assignment 2


Preprocessing

A2 updated defined Groups

This is the short glaze of the newly defined-group table after refine the samples from 17 to 15.

celltype tissue patient
CD4_thymus_1 CD4 thymus 1
CD4_thymus_2 CD4 thymus 2
CD4_thymus_3 CD4 thymus 3
CD4_thymus_4 CD4 thymus 4
CD4_bloodinfant_1 CD4 bloodinfant 1

This is a short glaze of what we start with:

The 15 samples are kept, The cell-type category is composed of CD4 and CD8, and there are 8 CD4 samples and 7 CD8 samples in the dataset. 2 samples of “bloodadult” tissue type was removed.

ensembl_gene_id hgnc_symbol CD4_thymus_1 CD4_thymus_2 CD4_thymus_3
ENSG00000000003 ENSG00000000003 TSPAN6 2.750699 4.395522 1.618822
ENSG00000000419 ENSG00000000419 DPM1 37.402365 31.998891 36.502047
ENSG00000000457 ENSG00000000457 SCYL3 18.826214 20.721749 23.585623
ENSG00000000460 ENSG00000000460 C1orf112 34.115816 23.810148 31.870986
ENSG00000000938 ENSG00000000938 FGR 3.268688 4.664636 2.780002

Create data_matrix based on new samples

Part1. Differential Gene Expression

1. Calculate p-values for each of the genes in the expression set.

Three MDS plot based on new selected samples


2. Multiple Hypothesis Testing: Benjamini-Hochberg correction

In order to correct the p-values, I decided to go with the BH correction method, this method is used in class. I tried two model design in this assignment the first one is the simple model regarding the cell types. In the paper, the expriment focus on the CD4 and CD8 T cells. This would be an obvious appraoch of designing the Data Model.


Model Design #1

The simple version linear model that only take cell type as the variable

(Intercept) samples$celltypeCD8
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 1
1 1
1 1
1 1
1 1
1 1
1 1

Ranked genes with Top p-value

ensembl_gene_id hgnc_symbol logFC AveExpr t P.Value adj.P.Val B
9800 ENSG00000172116 CD8B 280.216105 140.008071 20.23733 0 1.00e-07 3.425836
6043 ENSG00000138835 RGS3 41.228723 26.976475 12.48887 0 2.48e-05 2.632282
7408 ENSG00000153563 CD8A 493.908635 236.540282 11.87527 0 3.26e-05 2.511982
13955 ENSG00000237506 RPSAP15 -2.117976 1.160512 -11.52623 0 3.65e-05 2.436896
14138 ENSG00000242071 RPL7AP6 -6.380788 3.853027 -11.26642 0 3.69e-05 2.377555

Short summary of # of genes that pass the thresholds

Gene pass the threshold p-value < 0.05 from Model #1 (Cell type only)
4545
Genes pass correction with adj.P-value < 0.05 (Cell type only)
1583
Gene pass the threshold p-value < 0.01 from Model #1 (Cell type only)
2158
Genes pass correction with adj.P-value < 0.01 (Cell type only)
540

Model Design #2

The complex version model that takes cell type and tissue type as its variables.

To create a different linear model with a little bit more complexity. I used two arguments, one is the previous selected cell type and the other from the MDS plot I selected the tissues.

(Intercept) samples$celltypeCD8 samples$tissuethymus
1 0 1
1 0 1
1 0 1
1 0 1
1 0 0
1 0 0
1 0 0
1 0 0
1 1 1
1 1 1
1 1 1
1 1 1
1 1 0
1 1 0
1 1 0

Ranked genes with Top p-value

This is the resulting table with ranked genes with Top p-value

ensembl_gene_id hgnc_symbol logFC AveExpr t P.Value adj.P.Val B
9800 ENSG00000172116 CD8B 278.623826 140.008071 21.51238 0 1.00e-07 4.884591
7408 ENSG00000153563 CD8A 486.339551 236.540282 15.09044 0 4.50e-06 4.271212
6043 ENSG00000138835 RGS3 41.451709 26.976475 13.07518 0 1.59e-05 3.913343
2600 ENSG00000106028 SSBP1 -46.072426 85.107571 -12.97505 0 1.59e-05 3.892086
13955 ENSG00000237506 RPSAP15 -2.103834 1.160512 -11.91344 0 3.79e-05 3.641291

Short summary of # of genes that pass the thresholds

Gene pass the threshold p-value < 0.05 from Model #2 (cell type + tissue)
6894
Genes pass correction from Model #2 (cell type + tissue)
4512
Gene pass the threshold p-value < 0.01 from Model #1 (Cell type only)
3960
Genes pass correction with adj.P-value < 0.01 (Cell type only)
1651

Compare the results from the two design models

The above plot shows the P value of genes between the two models. Model #1 with only cell-type is colored in Orange, Model #2 with cell type + tissue is colored in Blue.

By reading this graph, we can conclude that the Multiple Parameter Data Model works better for my dataset, and I will keep using the second model for the future study.


Using QLF method within EdgeR package

Calculate differential expression using the Quasi liklihood model with coef=“cell-type”

logFC logCPM F PValue FDR
ENSG00000172116 5.110761 7.478149 526.0195 0 0e+00
ENSG00000103569 6.253015 3.031245 471.4154 0 0e+00
ENSG00000256039 7.246702 6.229868 390.9360 0 0e+00
ENSG00000153563 6.602412 8.250343 346.2413 0 0e+00
ENSG00000138835 2.783237 5.138695 341.3913 0 0e+00
ENSG00000203747 6.606881 5.622688 308.5814 0 0e+00
ENSG00000181036 4.843257 3.666595 288.4766 0 0e+00
ENSG00000128294 1.969806 6.064547 261.2659 0 1e-07
ENSG00000172543 4.174503 6.433384 242.9130 0 1e-07
ENSG00000186407 6.631176 3.924736 231.4641 0 1e-07
x
BH
x
samples$celltypeCD8
x
glm

Short summary of genes that pass the threshold

Total genes
16205
Gene pass the threshold p-value < 0.05 using edgeR
5546
Genes pass correction with QLF model with p-value < 0.05
3775
Gene pass the threshold p-value < 0.01 using edgeR
3658
Genes pass correction with QLF model with p-value < 0.01
2079
Genes p-value < 0.05 and abs(logFC) >= 1
2006


Estimate the dispersion of my data

Using the MeanVar Plot to explore the mean-variance relationship. Only the trended dispersion is used under the quasilikelihood pipeline.

Calculate dispersion to observe variance deviates from the mean. Biological Coefficient of Variation (BCV) and dispersion is a measure of how much variation there in the samples.


3. Highlight genes of interest with MA Plot or a Volcano plot

Here is a summary of the number of Up regulated and down regulated genes:

Up Regulated Genes
3524
Down Regulated Genes
2022
M A plot

Use the quasi-liklihood models to fit the data and QLF -test to test for differential expression

The MA plot shows the difference between two conditions by graphing the average intensity by log ratio

Here is several things to be mensioned in this plot:

  1. There are multiple outliners which has a significant value of the logCPM (average log expression), therefore I assign xlim and ylim to zoom in to the plot to be able to observe whether the distribution of the up and down regulated genes is as expected.

Volcano Plot

This is the volcano plot using EnhancedVolcano package, the


4. Visualize the top hits using a heatmap

HeatMaps of top hit genes using Limma Model#2 (accounting for cell-type and tissue variability)

  • With P.value < 0.05
  • Columns ordered by cell types (CD4 or CD8)

  • With adj.P.Val < 0.01
  • Columns ordered by Tissues (Thymus or Bloodinfant)

Generally we can observe the same cell type with same tissue type samples are definitely clustered in groups.

By reading the first heatmap we can observe that with in the Cell type groups, different tissue type shows differences in there gene expression pattern, with the second heatmap, within the same tissue, the gene of different cell type also express differently.


Up and down-regulated genes
up_genes <- rownames(qlf_output_hits$table)[qlf_output_hits$table$logFC >= 0 
                            & qlf_output_hits$table$FDR <= 0.05]

up_genes <- normalized_count_data[up_genes, ]

down_genes <- rownames(qlf_output_hits$table)[qlf_output_hits$table$logFC <= 0 
                            & qlf_output_hits$table$FDR <= 0.05]

down_genes <- normalized_count_data[down_genes, ]


diff_genes <- rbind(up_genes, down_genes)

# write.table(rownames(up_genes), file = "up_genes.txt", quote = F, row.names = F, col.names = F)
# write.table(rownames(down_genes), file = "down_genes.txt", quote = F, row.names = F, col.names = F)
write.table(rownames(diff_genes), file = "diff_genes.txt", quote = F, row.names = F, col.names = F)

Part2. Thresholded over-representation analysis (ORA)

Which method did you choose and why?

I used g:profiler for the thresholded gene set enrichment analysis.

What annotation data did you use and why? What version of the annotation are you using?

I chose GO:BP, KEGG and REAC as my annotation data.

GO biological process: releases/2019-07-01 KEGG: KEGG FTP Release 2019-09-30 Reactome: ensembl classes: 2019-10-2

How many genesets were returned with what thresholds? The threshold is 0.05.

Upregulated genes: 1353 go bioligical process terms, 110 KEGG terms and 103 REAC terms

Downregulated genes: 657 go bioligical process terms,22 KEGG terms and 376 REAC terms

Diff genes: 538 go bioligical process terms, 31 KEGG terms and 63 REAC terms

Run the analysis using the up-regulated set of genes, and the down-regulated set of genes separately. How do these results compare to using the whole list (i.e all differentially expressed genes together vs. the up-regulated and down regulated differentially expressed genes separately)?

#####Up-Regulated Genes

#####Down-Regulated Genes

Up + Down Genes (all differentially expressed genes)

References